Machine-learning-based diagnosis of thyroid fine-needle aspiration biopsy synergistically by Papanicolaou staining and refractive index distribution

We developed a machine learning algorithm (MLA) that can classify human thyroid cell clusters by exploiting both Papanicolaou staining and intrinsic refractive index (RI) as correlative imaging contrasts and evaluated the effects of this combination on diagnostic performance. Thyroid fine-needle aspiration biopsy (FNAB) specimens were analyzed using correlative optical diffraction tomography, which can simultaneously measure both, the color brightfield of Papanicolaou staining and three-dimensional RI distribution. The MLA was designed to classify benign and malignant cell clusters using color images, RI images, or both. We included 1535 thyroid cell clusters (benign: malignancy = 1128:407) from 124 patients. Accuracies of MLA classifiers using color images, RI images, and both were 98.0%, 98.0%, and 100%, respectively. As information for classification, the nucleus size was mainly used in the color image; however, detailed morphological information of the nucleus was also used in the RI image. We demonstrate that the present MLA and correlative FNAB imaging approach has the potential for diagnosing thyroid cancer, and complementary information from color and RI images can improve the performance of the MLA.

www.nature.com/scientificreports/ Interestingly, the recent improvements in the performance of MLAs have advanced algorithms for thyroid FNAB, making it possible to classify the given digital medical imaging data more effectively 7,8,12,13 . However, the method for acquiring digital data and the retrieval of imaging information to be utilized for MLAs from thyroid FNAB specimens has been poorly studied and neglected, despite being an important determinant of MLA performance. Most previous studies have used color monolayer images of Papanicolaou-, Giemsa-, and hematoxylin-eosin-stained specimens or morphometric parameters calculated from these images 7,8,13 . The advantage of using these images is that they are relatively easy to obtain, clinicians are familiar with them, and they represent the current standard of practice. However, whether these are the best digital data for accurately diagnosing thyroid cancer through MLA remains unclear.
To maximize the advantages of MLAs, high-quality, high-resolution, and high-content images are required 14,15 . This is a prerequisite for correctly assessing the characteristics of suspicious FNABs. The aim of this study was to pursue a higher content of cytopathology end-points and evaluate the potential of diagnoses using standard thyroid FNAB brightfield microscopy images combined with an emerging quantitative phase imaging technique (QPI). QPI exploits the intrinsic refractive index (RI) distribution of cells and tissues as quantitative label-free imaging contrast 16,17 . RI images can show complementary and synergistic features to brightfield microscopebased color images for the same cells or tissues due to the differences in imaging methods 18 . RI images provide structural or morphological information of cellular or subcellular structures 17,[19][20][21] , whereas brightfield images of Papanicolau-stained slides provide molecular-specific information 22 . More importantly, RI is a quantitative and reproducible quantity; it is a physical feature that remains constant regardless of the venue from where it is obtained. Therefore, obtaining high quality images less dependent on sample preparation and working [23][24][25] . In this study, we trained and tested an MLA to distinguish between benign and malignant thyroid cell clusters using digital color-and RI-images of Papanicolaou-stained thyroid FNAB specimens. Furthermore, we investigated whether the information from RI images could improve the accuracy of the MLA by supplementing information from color images for the same specimens.

Materials and methods
Thyroid cell cluster specimens. We performed a single-center cross-sectional study of thyroid cell clusters obtained via thyroid FNAB from benign or malignant human thyroid nodules. Thyroid FNAB slides produced from July 1, 2020, to December 31, 2020, were selected from the medical database of the institution. A benign case was defined as a case in which the FNAB result was "benign (II)" according to The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) 26 . A malignant case was defined as a case in which the FNAB result was "suspicious for malignancy (TBSRTC V)" and was confirmed to be papillary thyroid carcinoma using surgical specimens or the result was "malignant (TBSRTC VI)". One Papanicolaou-stained liquid-based cytology smear slide per patient was selected. An expert pathologist reviewed each slide and randomly selected up to 20 thyroid cell clusters per slide. Cell clusters were excluded when (a) they originated from thyroid cancer but did not contain cells with characteristics of malignancy or (b) the quality of digital images obtained from them was insufficient for analyses. Image acquisition and processing. For each thyroid cell cluster, one two-dimensional color photograph and one three-dimensional RI tomograph were simultaneously acquired using the optical diffraction tomography (ODT) system equipped with a brightfield imaging acquisition module. For this study, we built a correlative ODT system by modifying an existing ODT system (HT-2H, Tomocube Inc., Daejeon, Republic of Korea) (Fig. 1a). Three-dimensional RI tomograms were then converted into two-dimensional RI images by projection along the Z-axis, to synchronize the model structure with that of color images.
Due to the varying sizes of thyroid cell clusters, using the predicted information from the fixed-size small regions of interest (patches) extracted from the images of clusters is more efficient. Therefore, each image containing a cluster was divided into numerous 256 × 256-pixel (26.1 μm × 26.1 μm) patches. Each patch overlapped adjacent patches by 128 pixels in one direction. The average count of the color image value was calculated for each patch, and we found that the patch with average counts of color image ≥ 170 generally contained a whole or a part of clusters within the patch. These patches were used as the smallest unit for analysis in this study; the patches with an average count of color image ≥ 170 containing only background materials were included without manual exclusion to increase generality.
MLA training and testing. The cluster and patch images were divided into training, validation, and test datasets for the deep learning models with respect to the ratio of malignancy over dataset. Images generated from one cluster were categorized together while dividing the dataset (i.e. patch images from the same cluster were included in either training, validation, or test datasets in batches).
The architecture of the MLA comprised two levels: patch-level and cluster-level (Fig. 1b). The detailed structure of the system is described separately (Supplementary Text 1, 2). Briefly, we first trained the MLA for patches in the CNN architecture (DenseNet-169) on a binary classification task to identify patches extracted from malignant cell clusters. Color images and RI images were used separately to generate two patch-level MLAs (colormodel and RI-model) (Fig. 1c). Consequently, the trained patch-level classification model generated a malignancy prediction heatmap for each cluster. The features of each cell cluster were extracted based on the heatmap, and a final tree-based cluster-level classification model XGBoost classifier was trained using these features (Fig. 1d). MLA models were generated using only color images (color-model), only RI images (RI-model), or both the types of images together (combined model), and their diagnostic performance was evaluated and compared based on the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy. www.nature.com/scientificreports/ Explanatory analysis. Details of the explanatory analyzes are described separately (Supplementary Text 3,   4). Briefly, we used gradient-weighted class activation mapping (Grad-CAM) to interpret the MLA classification process. Grad-CAM emphasized the local features of the points wherein MLA judges malignancy. Additionally, patch images were grouped based on the prediction score of patch-level MLA (i.e., how highly the MLA judged the probability that a given patch was extracted from a malignant cluster) or using t-distributed stochastic neighbor embedding (t-SNE) analysis. In each group, the sizes of the nucleus and the degree of detail of the images around the nucleus were evaluated. The degree of detail of the images was quantitatively evaluated using the Brenner gradient.
Ethics statement. This study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the institutional review board of the National Cancer Center (IRB number: NCC2020-0126), which waived the requirement for informed consent for this study.

Results
Patients and specimens. Overall, 1,535 thyroid cell clusters obtained from 124 patients were included in this study (Table 1, and Supplementary Table 1). The numbers of benign and malignant clusters were 1,128 (73.5%) and 407 (26.5%), respectively. Cell clusters were divided into training (n = 988), validation (n = 261), and test (n = 286) datasets, and the ratio of the benign and malignant clusters in each dataset was maintained similar to the ratio in the entire dataset.
Model performance. Due to the training of the patch-level classification model, the color-model showed an accuracy of 0.975, which was better than that of the RI-model (0.937) ( Table 2 and Supplementary Fig. 1). False negatives accounted for a considerable amount (3.75% of the total count) of the overall classification results of the patch-level RI-model. Most false-negatives were accounted for patches with noise or artifacts caused by the staining process. www.nature.com/scientificreports/ In the cluster-level classification, the combined model using information from both types of images showed an accuracy of 1.000 (perfect classification of benign and malignant clusters), which was higher than that of the models using only a single imaging modality (0.980 for the color-model, and 0.980 for the RI-model).
We also conduct experiments on different MLA models on cluster-level classification including Random Forest, Support Vector Machine and Multi-layer Perceptron. We can confirm the robustness of performances on different models and the relationship between color and RI images. The model performances are summarized in Supplementary Table 3.

Gradient-weighted class activation mapping.
Grad-CAM results of the selected patches are presented in Fig. 2. The overlay image showed that the color-model and RI-model focused on distinct areas for the same specimen. Activation of the color-model mainly appeared in large-sized nuclei, indicating that a patch is highly likely to be classified as a malignancy in the presence of large-sized nuclei. In contrast, the RI-model showed high activation in the nuclei with high image gradients and relatively clear intranuclear structures.
Image characteristics according to model prediction scores. Patch images according to the prediction score of the patch-level model were visualized and analyzed to demonstrate the differences in trends between the color-model and RI-model (Fig. 3). The correlation between the size of the nuclei and the prediction score was prominent in the color-model (the larger the nucleus, higher the probability of malignancy) but was less pronounced in the RI-model.
The degree of details of the images surrounding the nuclei quantified using the Brenner gradient was high when the prediction score of the RI-model was either very high (0.8-1.0) or very low (0.0-0.2), whereas the model confidence was high. This finding indicates that the more detailed the shape around the nucleus, the more clearly the cells could be distinguished, whether benign or malignant and that the RI-model performed classification by focusing on the detailed structure of the nuclei. In contrast, the relationship between the prediction score and Brenner gradient was not obvious for the color-model. T-distributed stochastic neighbor embedding analysis. t-SNE analysis was performed for patchlevel models to observe the patch grouping of each model (Fig. 4). t-SNE analysis of the color-model led to grouping according to nucleus size (Fig. 4a). As a result of RI-model analysis, grouping according to nucleus size was still observed, but the RI model's group boundaries were ambiguous when compared to those of the colormodel (Fig. 4b). In many cases, viewing the detailed structure of the patch was difficult when the sample was on the boundary region in the t-SNE plot of the RI model. However, when both the color-and RI-models were used together, the benign and malignant groups were more distinctly separated (Fig. 4c).

Discussion
In this study, a combination of RI image data and color Papanicolaou-stained image data improved the accuracy of MLA for diagnosing cancer using thyroid FNAB specimens. The classification results of the MLA using color Papanicolaou-stained images were highly dependent on the size of the nucleus, but those of the MLA using RI  www.nature.com/scientificreports/ images were less dependent on nucleus size and were affected by information around the nuclear membrane. The final algorithm using data from both types of images together distinguished thyroid cell clusters from benign thyroid nodules and PTC with 100% accuracy. MLA has shown superior diagnostic performance using images of thyroid FNAB specimens when a convolutional neural network (CNN) architecture was adopted, which is effective for image analysis 7,8,12,13 . Guan et al. 13 studied a CNN-based MLA for classifying hematoxylin-eosin-stained FNAB specimens of benign thyroid nodule and PTC (TBSRTC II, V and VI). A total of 887 fragmented color images were used in this study, which were cropped from 279 images taken using a digital camera attached to a brightfield microscope. The trained algorithm exhibited 97.7% accuracy for distinguishing between 128 test images of benign and malignant nodules. Range et al. 8 used MLA to classify Papanicolaou-stained FNAB specimens of broader spectrum thyroid nodules (TBSRTC II-VI). They used 916 color images obtained using a whole slide scanner. The trained MLA distinguished malignant from benign nodules with high accuracy (90.8%), comparable to that of a pathologist. Similarly, a CNN-based MLA performed well in our study, exhibiting high-accuracy patch-level classification (97.3%) and cluster-level classification (99.0%), using only color Papanicolaou-stained images.
However, given that the purpose of FNAB is to determine whether to operate on thyroid nodules, it must not only exhibit high overall accuracy, but also minimize serious misclassification, such as classification of an obvious malignancy as benign or that of an overtly benign nodule as a malignancy. In Guan's study, MLA misclassified some cases that a pathologist classified as obviously benign as a malignancy. Similarly, in Range's study, MLA misclassified some clearly benign nodules as malignant or misclassified a malignant nodule that was indicated for surgery as benign 8 . These issues are problematic because they can lead to an erroneous treatment plan for patients who would receive proper treatment if they underwent the current standard care. We studied nodules with relatively distinct benign or malignant characteristics (TBSRTC II, V, and VI). Our findings that RI data improved the accuracy of MLA in these nodules have important clinical significance since these indicate a potential reduction in the aforementioned serious misclassification.
Guan et al. 13 suggested that the significant misclassifications of MLA for the thyroid FNAB specimens could be related to the nucleus size. In their study, the cells in false-positive cases showed large nuclei with a high www.nature.com/scientificreports/ mean pixel color information similar to malignant cells, but the pathologist determined that these cells had a typically benign morphology. The authors interpreted that the classification of MLA was based on the size and staining of the nucleus, but not on the shape. Furthermore, in our results, MLA based on color images showed limitations in accurately classifying benign thyroid cells with a large nucleus or malignant thyroid cells with a small nucleus because the size of the nucleus was the main feature required for classification. However, MLA classification based on the RI image was less affected by nucleus size. This suggests that RI images for can compensate for the limitations of MLA using color images for FNAB specimens whose nuclear size is not typical for benign or malignant cells. Further results from analyses to explain the models suggest that RI-image based MLA uses the structure and shape of the nucleus for classification. In addition to the algorithm being activated mainly for large nuclei in color images, the algorithm was activated not only by large nuclei but also by nuclei with a clear structure in RI images. The certainty of the MLA classification results was proportional to the detail of the information around the nuclear membrane when based on RI images, but not when based on color images. Detailed nuclear structures, such as nuclear membrane irregularity and micronucleoli are important indicators of thyroid cancer diagnosis 26 . Thus, the accuracy of MLA classification can be improved when such information is incorporated.
Another potential strength of RI images is the integration of information of a wide vertical space. In a thyroid cytology specimen, cells are scattered over a wide vertical space (i.e. multiple z-plains) rather than over a plane. A single layer (z-plain) 2D image cannot address this vertical spread, and information from out-of-focus cells is likely to be lost or distorted. In contrast, in the RI image obtained through ODT, cells located in different Z-plains are in focus simultaneously. In our study, MLA based on color images showed a false positive result for some out-of-focus patches, whereas MLA based on RI image showed a true negative result for the same image patches In the five groups classified using 0.2 point-intervals of the classification scores, the representative images, mean nuclear area, and mean Brenner gradient are presented. The corresponding red and blue boxes are patches from the malignant and benign clusters, respectively. The mean nucleus area and mean Brenner gradient were calculated using 30 randomly chosen samples for each interval. www.nature.com/scientificreports/ (data not shown). However, the out-of-focus area is only a part of the color images, and the use of multiple z-plane images did not improve the accuracy of MLA when compared to the use of a single z-plane image in a previous study 8 . Therefore, it is unclear whether the aforementioned factor significantly affects the accuracy of MLA. This study has certain limitations. Despite the large number of sample measurements, this study was performed in a single center and could not cover all conditions of specimens that could exist in real clinical environments. ODT provides optimal RI imaging in un-manipulated living cells 27 , but we obtained RI images from chromatically stained cells. Staining acted as an extrinsic noise or artifact in the RI images, which reduced the accuracy of MLA. Further study is required to determine the effect of staining on the outcomes. Finally, up to 30% of FNABs may have "indeterminate" cytopathology (TBSRTC III and IV). This study targeted specimen characteristic of benign or malignant thyroid nodules (TBSRTC II, V, and VI), and therefore, the currently trained algorithm cannot be directly applied to TBSRTC III and IV specimens without relevant training.
To investigate the complementary nature of RI images and color images, a 2D MIP image was generated by projecting the 3D RI image along the z-axis, thereby excluding the influence of dimensionality. Previous studies in the field of cell classification have demonstrated improved performance when using 3D RI images compared to 2D images 28,29 . Although our research did not incorporate 3D images due to the specific research objectives, www.nature.com/scientificreports/ we plan to expand our investigations in future studies by incorporating 3D RI images and other 3D imaging modalities.
In this study, we demonstrated the efficacy of multiplexing of RI with standard brightfield imaging using a single ODT platform for MLA-based classification of benign and malignant thyroid FNABs. Multiplexed ODT showed promise for the development of a more accurate classification of thyroid FNABs while reducing the inherent uncertainty and error observed in the current diagnostic standards. Thus, an ODT-based MLA may potentially contribute to an improved cost-effective and rapid point-of-care management of thyroid malignancies.

Data availability
The datasets used and/or analysed during the current study available from the corresponding authors on reasonable request.